Generating the training and test set involves using the ocbio.extract
module with the chosen gold standard positive and negative datasets.
This notebook is supposed to act like a script to do this, with documentation inline.
First, the datasource table must be regenerated at the top directory containing the data:
In [1]:
cd ../../
In [2]:
import csv
As the data repository has now been annexed the datasource table must first be unlocked:
In [3]:
!git annex unlock datasource.tab
In [4]:
#this script should be updated to add new features when available
f = open("datasource.tab", "w")
c = csv.writer(f,delimiter="\t")
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# Y2H SVM feature
c.writerow(["Y2H/Y2H.txt","Y2H/Y2H.db","valindexes=(4);ignoreheader=1;zeromissing=1"])
# ENTS feature
c.writerow(["ENTS","ENTS","generator=ents/human.ENTS.features.pickle"])
# ENTS summary feature
c.writerow(["ENTS_summary","ENTS_summary","generator=ents/human.Entrez.ENTS.summary.pickle"])
f.close()
In [5]:
import sys
In [6]:
sys.path.append("opencast-bio/")
In [7]:
import ocbio.extract
In [8]:
reload(ocbio.extract)
Out[8]:
Now that the data directory has been annexed the database files must first be unlocked:
In [9]:
!git annex unlock Y2H/Y2H.db
In [10]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.tab", verbose=True)
In [11]:
assembler.regenerate(verbose=True)
Using a set of positive interactions found through the iRefIndex project created in this notebook we can create a set of positive and negative feature vectors to train the classifier with:
In [12]:
assembler.assemble("iRefIndex/human.iRefIndex.positive.pairs.txt",
"features/human.iRefIndex.positive.vectors.txt",verbose=True)
In [13]:
assembler.assemble("iRefIndex/human.iRefIndex.negative.pairs.txt",
"features/human.iRefIndex.negative.vectors.txt",verbose=True)
In [14]:
assembler.assemble("forGAVIN/mergecode/OUT/edgelist.txt",
"features/human.activezone.txt",verbose=Tfeatures/